Here’s some exploratory plots for etort-100. “6-mer entropy” is the entropy of the distribution of 6mers in windows of length 500.
## a single tortoise
counts <- read.table("etort-100_test.counts.gz", header=TRUE)
major <- apply(counts,1,which.max)
coverages <- rowSums(counts)
corrects <- counts[ cbind(1:nrow(counts),major) ]
errors <- coverages-corrects
fasta <- do.call( paste, c( as.list(scan("gopherus_agassizii_mtgenome.fasta",skip=1,what="char",sep="")), list(sep='') ) )
chop <- function (string, win) {
nwins <- floor(nchar(string)/win)
breaks <- seq(1,by=win,length.out=nwins+1)
substring(string,first=breaks[-length(breaks)],last=breaks[-1]-1)
}
winlen <- 500
winbreaks <- seq(1,by=winlen,length.out=floor(nchar(fasta)/winlen)+1)
winmids <- winbreaks[-1] - diff(winbreaks)/2
klen <- 6
kmers <- lapply( lapply( chop(fasta,winlen), chop, win=klen ), table )
plogp <- function (p) { ifelse(p==0|p==1,0,-p*log(p)) }
entropies <- sapply( kmers, function (x) { sum(plogp(table(x)/sum(x))) } )
#pdf(file="etort-100_viz.pdf",width=10,height=10)
layout(matrix(c(1,2,1,3),nrow=2))
plot(coverages,type='l')
lines(errors,col='red')
lines(winmids,entropies*mean(coverages),col='green',pch=20,type='b')
legend("topleft",lty=1,col=c("black","red","green"),legend=c("coverage","nonmajor allele count","6mer entropy"))
plot(jitter(corrects),jitter(errors),xlab='major allele counts',ylab='nonmajor allele counts',pch=20,cex=0.5,col=adjustcolor("black",0.25))
abline(0,1)
acf(errors/coverages,lag.max=500)
#dev.off()
Read in the data:
counts <- as.matrix(read.table("mt272.counts.gz",header=TRUE))
nsamples <- ncol(counts)/4
bases <- c("A","C","G","T")
clist <- lapply(c(A=1,C=2,G=3,T=4),function(shift){counts[,shift+seq(0,by=4,length.out=nsamples)]})
stopifnot( length(tort.ids) == nsamples )
indivs <- tort.ids
colnames(clist$A) <- colnames(clist$C) <- colnames(clist$G) <- colnames(clist$T) <- indivs
coverage <- clist$A + clist$C + clist$G + clist$T
mean.coverage <- colMeans(coverage)
Here are the coverages:
barplot(do.call(rbind,lapply(clist,colSums)),las=2, main="nucleotide counts")
barplot(do.call(rbind,lapply(clist,function(x)colSums(x)/colSums(coverage))),las=2, main="nucleotide frequencies")
There’s one slightly wierd base composition in there, sample etort-97; the rest look good.
I’m going to omit the torts with mean coverage below 15 to make subsequent plots nicer.
use.torts <- (mean.coverage >= 15)
omitted.torts <- indivs[!use.torts]
cat("Omitting",paste(omitted.torts,collapse=' '),"\n")
## Omitting etort-2 etort-13 etort-24 etort-26 etort-33 etort-42 etort-45 etort-46 etort-48 etort-50 etort-52 etort-54 etort-56 (sheared2) etort-57 etort-59 etort-68 etort-70 etort-71 etort-72 etort-78 etort-81 etort-85 etort-95 etort-97 etort-102 etort-103 etort-105 etort-106 etort-107 etort-108 etort-109 etort-116 etort-117 etort-189 etort-191 etort-192 etort-193 etort-194 etort-197 etort-212 etort-227 etort-251 etort-265 etort-288 etort-289
counts <- counts[,rep(use.torts,each=4)]
coverage <- coverage[,use.torts]
for (k in 1:4) { clist[[k]] <- clist[[k]][,use.torts] }
indivs <- indivs[use.torts]
nsamples <- sum(use.torts)
mean.coverage <- mean.coverage[use.torts]
rm(use.torts)
The coverages per individual are pretty close to Poisson, but a bit overdispersed; here are plots for a few samples; lines are the Poisson distribution with matching mean; but total coverages are very overdispersed (last plot):
layout(t(1:2))
for (k in sample(seq_along(indivs),4)) {
plot( table(coverage[,k]), main=paste("coverages:", indivs[k]) )
lines(0:100,dpois(0:100,mean(coverage[,k]))*nrow(coverage),col='red')
}
layout(1)
ch <- hist( rowSums(coverage), breaks=50, main="total coverage" )
lines( ch$mids, diff( ppois( ch$breaks, mean(rowSums(coverage)) )*nrow(coverage) ), col='red' )
This is as we would expect since coverages between individuals, across sites, is correlated.
OK, let’s pick the major allele at each site, for each tortoise, then count up how many minors each tort has
total <- do.call(cbind,lapply(clist,rowSums))
total.coverage <- rowSums(coverage)
total.major.allele <- apply(total,1,which.max)
major.allele <- ifelse(
pmax(clist[[1]],clist[[2]]) >= pmax(clist[[3]],clist[[4]]),
ifelse( clist[[1]] >= clist[[2]], 1, 2 ),
ifelse( clist[[3]] >= clist[[4]], 3, 4 ) )
majors <- counts[ cbind( rep(1:nrow(counts),nsamples), as.vector(major.allele+4*(col(major.allele)-1)) ) ]
dim(majors) <- c( nrow(counts), nsamples )
colnames(majors) <- indivs
minors <- coverage-majors
minor.freqs <- rowSums(minors)/total.coverage
# give zero-coverage sites a major allele of NA
major.allele[majors==0] <- NA
# tabulate the nucleotide usage
table(bases[major.allele])
##
## A C G T
## 967782 483836 995068 1309312
Now, let’s get an idea of how many truly polymorphic sites there are, i.e., those having more than one different major allele across individuals. In total, there are 246 such sites, which is a proportion 0.0148658 of the total. For instance, here is the table of how many sites have how many individuals with a different major allele:
pander( table(rowSums(major.allele != total.major.allele)) )
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11 | 13 | 17 | 18 | 19 | 34 | 78 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16220 | 110 | 18 | 11 | 13 | 12 | 3 | 2 | 3 | 1 | 2 | 1 | 3 | 1 | 1 | 1 | 1 |
| 85 | 87 | 104 | 109 | 110 | 113 | 114 | 118 | 123 |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 4 | 11 | 1 | 1 | 6 | 30 | 1 |
Immediately, we see that there is (probably) a haplotype differing by 30 SNPs shared by 118 of the individuals. Let’s look at those 112 sites where only one individual had a different major allele. Are those mostly errors or not?
singletons <- which( rowSums(major.allele != total.major.allele) == 1 )
others <- sapply( singletons, function (k) { which(major.allele[k,]!=total.major.allele[k]) } )
singleton.coverages <- cbind( major=majors[cbind(singletons,others)], minor=minors[cbind(singletons,others)] )
plot(jitter(singleton.coverages,factor=0.5),main="major/minor coverages at singleton sites")
These mostly look pretty well supported, but I don’t know what’s going on with the sites with high coverage for two alleles. By individual, these break down as follows:
# frequency spectrum of singletons by individual
pander(table(table(names(others))))
| 1 | 2 | 3 | 4 | 6 |
|---|---|---|---|---|
| 48 | 14 | 8 | 1 | 1 |
For instance, 48 individuals have only one site at which their major allele differs from everyone else’s, while tortoise etort-195 has 6 such sites.
Now, feeling pretty good about assuming that the major allele at each site is the true site, except for maybe a couple of cases, let’s see how mismatches to the major allele look. First, let’s look at the error spectrum, i.e. the counts for each other base, split up by which is the major allele.
error.mat <- sapply( 1:4, function (k) {
sapply( 1:4, function (j) {
sum( clist[[k]][major.allele==j], na.rm=TRUE )
} )
} )
dimnames(error.mat) <- list( paste("major:",bases,sep=''), bases )
# raw counts
pander(error.mat)
| A | C | G | T | |
|---|---|---|---|---|
| major:A | 28021595 | 2661 | 29607 | 6489 |
| major:C | 5834 | 14084122 | 1269 | 21869 |
| major:G | 43753 | 3954 | 29101719 | 11017 |
| major:T | 7214 | 17909 | 14442 | 38077658 |
# probabilities
pander(error.mat/rowSums(error.mat))
| A | C | G | T | |
|---|---|---|---|---|
| major:A | 0.9986 | 0.00009483 | 0.001055 | 0.0002313 |
| major:C | 0.0004134 | 0.9979 | 0.00008992 | 0.00155 |
| major:G | 0.0015 | 0.0001356 | 0.998 | 0.0003778 |
| major:T | 0.0001893 | 0.0004698 | 0.0003789 | 0.999 |
Now, let’s see how these lie along the chromosome. Here’s a plot of coverage normalized by mean coverage for the tortoise, and then minor allele frequency, with sites with coverage less than 10 masked out:
matplot(sweep(coverage,2,mean.coverage,"/"),type='l',lty=1,col=adjustcolor(rainbow(32),0.5),main="coverage")
minorfreq <- minors/coverage
minorfreq[coverage<10] <- NA
matplot(minorfreq,type='l',lty=1,col=adjustcolor(rainbow(32),0.5),main="minor allele(s) frequency")
More usefully, here’s a plot of total minor allele frequency:
plot(rowSums(minors)/rowSums(coverage),pch=20)
abline(h=1/mean(coverage))
So, in certain regions, at certain sites, there are consistently the same alleles popping up in around 4% of the reads. Since we have nuclear coverage of about 1x, that’s about right for nuclear copies sneaking in (the horizontal line is at 1/mean(coverage) = 0.0343203).
Complementing this, here is a plot of the minor allele frequency against the proportion of individuals in which that site is polymorphic:
poly <- rowSums( coverage>0 & minors>0 )/rowSums( coverage>0 )
plot( minor.freqs, poly, xlab="minor allele frequency", ylab="proportion of polymorphic samples" )
This curve cries out for an explanation; it probably has to do with SNPs of varying frequency in the nuclear copies.
What proportion of the minor allele reads come from the nuclear copies? Well, here is the distribution of total number of minor alleles seen, both in counts and in frequencies.
layout(t(1:2))
plot( table( rowSums(minors) ), xlim=c(0,100), main="number of sites with minor frequency", xlab="minor allele count" )
plot( table( rowSums(minors) )/nrow(minors), xlim=c(0,100), main="frequency of sites", xlab="minor allele count" )
So, there are 4420 sites where all but one read agreed, which is 0.2671018 of the sites.
Let’s see how estimation of error rate is affected by those duplicates. The mean proportion of minor alleles is 0.0015168, while if we omit the sites with minor allele frequency above 1% (likely due to nuclear duplicates), thus dropping only a fraction 0.0370438 of the sites, this drops substantially to 0.0002949. So, the duplicated regions account for a substantial fraction of the read-based heterozygosity.
What to do for estimating the basic, per-read error rate? Well, there’s a region of 1Kb from 1801–2800 that doesn’t appear to have any mismapped reads:
goodregion <- seq(1801,2800)
plot(rowSums(minors)/rowSums(coverage),pch=20,xlim=range(goodregion))
muhat <- sum(minors[goodregion,])/sum(coverage[goodregion,])
… so we could get an overall estimate of per-read error rate from there. That works out to \[ \hat \mu = 0.0002392 , \] with a 95% confidence interval of 0.0002272, 0.0002512 which seems reassuringly low.
But, we’d really like to get at per-individual heterogeneity. Here’s the breakdown of total number of minor and major counts in this region by individual, along with the \(y = \hat \mu x\) line and the 0.1% and 99.9% quantiles for the Poisson(\(\hat \mu x\)) distributions.
goodcounts <- data.frame( coverage=colSums(coverage[goodregion,]), minor=colSums(minors[goodregion,]) )
cov.vals <- with(goodcounts, { seq(min(coverage),max(coverage),length.out=50) } )
quants <- sapply( c(.001,.999), function (q) { qpois( q, cov.vals*muhat ) } )
plot( minor ~ coverage, xlab="total coverage per individual", ylab="minor allele count per individual", data=goodcounts )
abline(0, muhat)
matlines( cov.vals, quants, lty=3, type='l', col='red')
That some of the 227 individuals fall outside of these quantiles indicates that there is (statistically) significant heterogeneity in error rate.
To check on the replicability of this, let’s do this for another good region, from 4500–5500:
goodregion.2 <- seq(4500,5500)
plot(rowSums(minors)/rowSums(coverage),pch=20,xlim=range(goodregion.2))
muhat.2 <- sum(minors[goodregion.2,])/sum(coverage[goodregion.2,])
Here we get \(\hat \mu = 0.0002222\), with a 95% confidence interval of 0.0002104, 0.000234 Now we can plot the error estimates per individual in the two regions against each other (the line is \(y=x\))
goodcounts.2 <- data.frame( coverage=colSums(coverage[goodregion.2,]), minor=colSums(minors[goodregion.2,]) )
plot( with(goodcounts,minor/coverage), with(goodcounts.2,minor/coverage), xlab="region 1", ylab="region 2" )
abline(0,1)
The correlation between them is 0.5476982 (and significant).